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Abstract 


Bayesian hierarchical clustering (BHC) is an ag- 
glomerative clustering method, where a proba¬ 
bilistic model is defined and its marginal like¬ 
lihoods are evaluated to decide which clusters 
to merge. While BHC provides a few advan¬ 
tages over traditional distance-based agglomer- 
ative clustering algorithms, successive evalua¬ 
tion of marginal likelihoods and careful hyper¬ 
parameter tuning are cumbersome and limit the 
scalability. In this paper we relax BHC into a 
non-probabilistic formulation, exploring small- 
variance asymptotics in conjugate-exponential 
models. We develop a novel clustering algo¬ 
rithm, referred to as relaxed BHC (RBHC), from 
the asymptotic limit of the BHC model that ex¬ 
hibits the scalability of distance-based agglom- 
erative clustering algorithms as well as the flex¬ 
ibility of Bayesian nonparametric models. We 
also investigate the reducibility of the dissimilar¬ 
ity measure emerged from the asymptotic limit of 
the BHC model, allowing us to use scalable algo¬ 
rithms such as the nearest neighbor chain algo¬ 
rithm. Numerical experiments on both synthetic 
and real-world datasets demonstrate the validity 
and high performance of our method. 


1 INTRODUCTION 


Agglomerative hierarchical clustering, which is one of the 
most popular algorithms in cluster analysis, builds a binary 


tree representing the cluster structure of a dataset (Duda 


et al. 2001|l. Given a dataset and a dissimilarity mea- 
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sure between clusters, agglomerative hierarchical cluster¬ 
ing starts from leaf nodes corresponding to individual data 
points and successively merges pairs of nodes with smallest 
dissimilarities to complete a binary tree. 


Bayesian hierarchical clustering (BHC) ( [Heller and| 
[Ghahrahmanij |2005| l is a probabilistic alternative of ag¬ 
glomerative hierarchical clustering. BHC defines a gen¬ 
erative model on binary trees and compute the probabil¬ 
ity that nodes are merged under that generative model to 
evaluate the (dis)similarity between the nodes. Since this 
(dis)similarity is written as a probability, one can nat¬ 
urally decide a level, where to stop merging according 
to this probability. Hence, unlike traditional agglomera¬ 
tive clustering algorithms, BHC has a flexibility to infer a 
proper number of clusters for given data. The source of 
this flexibility is Dirichlet process mixtures (DPM) ( jFergu-j 
1973| Antoniak 19741 used to define the generative 


model of binary trees. BHC was shown to provide a tight 


lower bound on the marginal likelihood of DPM (Heller 


[and Ghahrahmanij [2005 [ jWallach et al.j |2010| ) and to be an 
alternative posterior inference algorithm for DPM. How¬ 
ever, when evaluating the dissimilarity between nodes, one 
has to repeatedly compute the marginal likelihood of clus¬ 
ters and careful tuning of hyperparameters are required. 


In this paper, we study BHC when the underlying distribu¬ 
tions are conjugate exponential families. Our contributions 
is twofold. First, we derive a non-probabilistic relaxation 
of BHC, referred to as RBHC, by performing small vari¬ 
ance asymptotics, i.e., letting the variance of the underly¬ 
ing distribution in the model go to zero. To this end, we 
use the technique inspired by the recent work ( jKulis and 
[Jordan)[2012||Jiang et al.||2012| l, where the Gibbs sampling 
algorithm for DPM with conjugate exponential family was 
shown to approach a fc-means-like hard clustering algo¬ 
rithm in the small variance limit. The dissimilarity measure 
in RBHC is of a simpler form, compared to the one in the 
original BHC. It does not require careful tuning of hyper¬ 
parameters, and yet has the flexibility of the original BHC 
to infer a proper number of the clusters in data. It turns 
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out to be equivalent to the dissimilarity proposed in (Tel- 


garsky and Dasgupta 2012|l, which was derived in different 


perspective, minimizing a cost function involving Bregman 
information ([Banerjee et al. 2005 | l. Second, we study the 
reducibility ( [Bruynooghe 1978| l of the dissimilarity mea¬ 
sure in RBHC. If the dissimilarity is reducible, one can use 
nearest-neighbor chain algorithm ( [Bruynooghe 1978| l to 
build a binary tree with much smaller complexities, com¬ 
pared to the greedy algorithm. The nearest neighbor chain 
algorithm builds a binary tree of n data points in 0{n^) 
time and 0(n) space, while the greedy algorithm does in 
0{n^ logn) time and 0{n^) space. We argue is that even 
though we cannot guarantee that the dissimilarity in RBHC 
is always reducible, it satishes the reducibility in many 
cases, so it is hne to use the nearest-neighbor chain algo¬ 
rithm in practice to speed up building a binary tree using 
the RBHC. We also present the conditions where the dis¬ 
similarity in RBHC is more likely to be reducible. 


2 BACKGROUND 

We briefly review agglomerative hierarchical clustering, 
Bayesian hierarchical clustering, and Bregman clustering, 
on which we base the development of our clustering al¬ 
gorithm RBHC. Let X = be a set of n data 

points. Denote by [n] = {1, ... ,n} a set of indices. A 
partition 7r„ of [n] is a set of disjoint nonempty subsets 
of \n] whose union is [n]. The set of all possible parti¬ 
tions of [n] is denoted by n„ For instance, in the case of 
[5] = {1,2,3,4, 5}, an exemplary random partition that 
consists of three clusters is tts = {{1}, {2,4}, {3, 5}}; its 
members are indexed by c G tts. Data points in cluster c is 
denoted by {xi\i G c} for c G 7r„. Dissimilarity 

between cq G 7r„ and ci G 7r„ is given by d(co, ci). 


clustering where the distance between clusters are defined 
as the Euclidean distance between the centers of the clus¬ 
ters, one can reduce the time complexity to O^kn?) where 
k n ([Day and Edelsbrunner 1984)l. 


Algorithm 1 Greedy algorithm for agglomerative hierarchical 
clustering 

Input: X = d(-, •). 

Output: Binary tree. 

1: Assign data points to leaves. 

2: Compute d({i}, {j}) for all i,j G [n]. 

3: while the number of nodes > 1 do 
4: Find a pair (co, Ci ) with minimum d (co, ci). 

5: Merge c co U Ci and compute d(c, c') for all d 7 ^ c. 

6: end while 


Algorithm 2 Nearest neighbor chain algorithm 

Input: X = reducible d(-, ■). 

Output: Binary tree. 

1: while the number of nodes > 1 do 
2 : Pick any Co. 

3: Build a chain ci = nn(co),C 2 = nn(ci),..., where 

nn(c) argmin^/ d(c, c'). Extend the chain until d = 
nn(ci-i) and Ci_i = nn(ci). 

4: Merge c Ci U Ci-i . 

5: If i > 2, go to line 3 and extend the chain from Ci-2- 

Otherwise, go to line 2. 

6: end while 


2.2 Reducibility 


Two nodes Cq and ci are reciprocal nearest neighbors 
(RNNs) if the dissimilarity d(co, ci) is minimal among all 
dissimilarities from cq to elements in 7 r„ and also minimal 
among all dissimilarities from ci. Dissimilarity d(-, •) is 
reducible ([Bruynooghe |1978[l, if for any cq, ci, C 2 G 7r„, 


2.1 Agglomerative Hierarchical Clustering 

Given X and d(-, •), a common approach to building a bi¬ 
nary tree for agglomerative hierarchical clustering is the 
greedy algorithm, where pairs of nodes are merged as 
one moves up the hierarchy, starting in its leaf nodes 
(Algorithm!^. A naive implementation of the greedy al¬ 
gorithm requires 0{rd) in time since each iteration needs 
0{'n?) to And the pair of closest nodes and the algorithm 
runs over n iterations. It requires 0{rd) in space to store 
pairwise dissimilarities. The time complexity can be re¬ 
duced to 0{n? logn) with priority queue, and can be re¬ 
duced further for some special cases; for example, in single 
linkage clustering where 

d(co,ci)= min d({i},{j}), 

*6co jGci 

the time complexity is 0{n?) since building a binary tree 
is equivalent to finding the minimum spanning tree in the 
dissimilarity graph. Also, in case of the centroid linkage 


d(co,ci) < min{d(co,C 2 ),d(ci,C 2 )} 

^ min{d(co, C2), d(ci, C2)} < d(co U ci, C2). ( 1 ) 

The reducibility ensures that if cq and ci are reciprocal 
nearest neighbors (RNNs), then this pair of nodes are the 
closest pair that the greedy algorithm will eventually And 
by searching on an entire space. Thus, the reducibility 
saves the effort of finding a pair of nodes with minimal 
distance. Assume that (cq, ci) are RNNs. Merging (cq, ci) 
become problematic only if, for other RNNs (c 2 , 03 ), merg¬ 
ing ( 02 , 03 ) changes the nearest neighbor of cq (or oi) to 
02 U 03 . However this does not happen since 

d(c2,C3) < min{d(c2,co),d(c3,co)} 

^ min{d(c 2 ,Co),d(c 3 ,Co)} < d(c2 U 03,00) 

^ d(co,oi) < d(o2 U 03,00). 

The nearest neighbor chain algorithm ( [Bruynooghe[ |1978[ ) 
enjoys this property and And pairs of nodes to merge by 
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following paths in the nearest neighbor graph of the nodes 
until the paths terminate in pairs of mutual nearest neigh¬ 
bors (Algorithm!^. The time and space complexity of the 
nearest neighbor chain algorithm are O(n^) and 0(n), re¬ 
spectively. The reducible dissimilarity includes those of 
single linkage and Ward’s method ( Ward[ 1963|l. 


2.3 Agglomerative Bregman Clustering 


Agglomerative clustering with Bregman divergence (|Breg- 


_ 1967 1 as a dissimilarity measure was recently devel¬ 
oped in ( |Telgarsky and Dasgupta||2012| l, where the cluster¬ 
ing was formulated as the minimization of the sum of cost 
based on the Bregman divergence between the elements in 
a cluster and center of the cluster. This cost is closely re¬ 
lated with the Bregman Information used for Bregman hard 
clustering ( Banerjee et al.|[T005| l). In ( |TeIgarsky and Das^ 
gupta 2012 I, the dissimilarity between two clusters (cq, Ci) 


is dehned as the change of cost function when they are 
merged. As will be shown in this paper, this dissimilar¬ 
ity turns out to be identical to the one we derive from the 
asymptotic limit of BHC. Agglomerative clustering with 
Bregman divergence showed better accuracies than tradi¬ 
tional agglomerative hierarchical clustering algorithms on 
various real datasets (Telgarsky and Dasgupta 2012|l. 


2.4 Bayesian Hierarchical Clustering 

Denote by 7^ a tree whose leaves are Xc for c G 7r„. A 


binary tree constructed by BHC (Heller and Ghahrahmani 


|2005| ) explains the generation of with two hypotheses 
compared in considering each merge; (1) the first hypoth¬ 
esis T-Lc where all elements in were generated from a 
single cluster c; (2) the alternative hypothesis where X^. 
has two sub-clusters Xc^ and each of which is as¬ 
sociated with subtrees and Tci, respectively. Thus, the 
probability of Xc in tree Tc is written as; 


p{X,\%)=piHMXc\He) 

+ {i-p{h.)}p{x,Mp{XcATc,), 

where the prior p(7fc) is recursively defined as 

7{i} '= a, 7c ‘^= Q!r(|c|) -f 7co7ci, 

p{Hc) Q!r(|c|)/7c, 

and the likelihood of Xc under Tic is given by 

p{Xc\nc) f I Y[p{xi\e)\p{de). 


( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 


Now, the posterior probability of He, which is the proba¬ 
bility of merging (cq, ci), is computed by Bayes rule; 


p{Hc\Xc,Tc) = 


piHc)p{Xc\Hc 

p{Xc\Tc) 


(6) 


In ( |Lee and Choi[ [2014) 1, an alternative formulation for the 
generative probability was proposed, which writes the gen¬ 
erative process via the unnormalized probabilities (poten¬ 
tial functions); 


^iXc\nc) = aT{\c\)p{Xc\nc), 
cl){Xc\Tc)= -/cP{Xc\Tc). 
With these dehnitions, is written as 


( 7 ) 

( 8 ) 


^(XclTc) = HXc\Hc) + (l>iXc,\Tco)(l>iXc,\TcJ, (9) 


and the posterior probability of He is written as 


p{He\Xe,Te) = UT 


^{XeMHXeATe,) 

cl^iXelHe) 


■ ( 10 ) 


One can see that the ratio inside behaves as the dissimilarity 
between cq and ci; 

d(co,cO - -^(x:\He)^ .■ 

Now, building a binary tree follows Algorithm with the 
distance in ( [TT] i. Beside this, BHC has a scheme to deter¬ 
mine the number of clusters. It was suggested in ( |Helle r| 
|and Ghahrahmani 2005| l that the tree can be cut at points 
p{Hc\Xe, Tc) < 0.5. It is equivalent to say that the we stop 
the algorithm if the minimum over d(co, ci) is greater than 
1. Note that once the tree is cut, the result contains forests, 
each of which involves a cluster. 

BHC is closely related to the marginal likelihood of DPM; 
actually, the prior p{Hc) comes from the predictive distri¬ 
bution of DP prior. Moreover, it was shown that computing 
(l){Xc\Tc) to build a tree naturally induces a lower bound 
on the marginal likelihood of DPM, pdpm (Xc) (Heller and 
|Ghahrahmani| [2005] l ; 


r(«) 

r(a -I- n) 


(l>{Xc\Tc) < pdpm(Xc). 


( 12 ) 


Hence, in the perspective of the posterior inference algo¬ 
rithm for DPM, building tree in BHC is equivalent to com¬ 
puting the approximate marginal likelihood. Also, cutting 
the tree at the level where d(co, ci) > 1 corresponds hnd- 
ing the MAP clustering of X. 

In (Heller and Ghahrahmani 2005| l, the time complexity 
was claimed to be O(n^). However, this does not count the 
complexity required to hnd a pair with the smallest dissim¬ 
ilarity via sorting. For instance, with a sorting algorithm 
using priority queues, BHC requires 0{'n? logn) in time. 

The dissimilarity is very sensitive to tuning the hyperpa¬ 
rameters involving the distribution over parametesp(0) re¬ 
quired to compute p{Xc\Hc)- An EM-like iterative algo¬ 
rithm was proposed in ( [Heller and Ghahrahmani 20051 to 
tune the hyperparameters, but the repeated execution of the 
algorithm is infeasible for large-scale data. 
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2.5 Bregman Diverences and Exponential Families 

Definition 1. ( lliBregman\ 1 967\ ) Let (p be a strictly convex 
differentiable function defined on a convex set. Then, the 
Bregman divergence, y), is defined as 


For this scaled distribution, the mean /r remains the same, 
and the covariance S becomes S//3 ( [Jiang et ^ 2012 1 . 
Hence, the distribution is more concentrated around its 
mean. The scaled distribution in the Bregman divergence 
form is 


= p[x) - (p{y) - {x - y,\/p{y)). (13) 

Various divergences belong to the Bregman divergence. 
For instance. Euclidean distance or KL divergence is Breg¬ 
man divergence, when ip{x) = or (p(x) = x log x, 

respectively. 

The exponential family distribution over a: S with nat¬ 
ural parameter 0 G 0 is of the form: 

p(xl0) = exp^(t(x),d} - 'ip{e) - /i(a:)|, (14) 

where t{x) is sufficient statistics, ip{6) is a log-partition 
function, and exp{—/i(a:)} is a base distribution. We as¬ 
sume that p(x\d) is regular (© is open) and t{x) is mini¬ 
mal {$a G s.t. Vx, {a,t{x)) = const). Let (pifi) be 
the convex conjugate of ij;: 

(p(/r) = sup |(/r,0) -'i/;(0)|. (15) 

eee '■ ^ 


p{x\9) 

= exp|-/3B^(<(a:),/x)| 



exp|^(^(f(a;)) - hffx)^. 

(20) 

According to p{x\9), p{9\f, v) is dehned with r = 
u = V!fd. Actually, this yields the same prior as non¬ 
distribution. 

^ T/fd, 
scaled 

II 

*2. 

exp 1 {9, t ) - vil;{9) - r 

)} 

= 

exp 1 {9, t) - v%p{9) - A-, '^) 

}.(21) 


3 MAIN RESULTS 


We present the main contribution of this paper. From now 
on, we assume that the likelihood and prior in Eq. (0 are 
scaled exponential families dehned in SectionjZ^ 


Then, the Bregman divergence and the exponential family 
has the following relationship: 


Theorem 1. { Banerjee et al.\ 2005) Let pf) be the con¬ 
jugate function of iff). Let 0 be the natural parameter 
and pb be the corresponding expectation parameter, i.e., 
pL = E[f(a;)] = V'ijj{9). Thenp{x\9) is uniquely expressed 
as 


p{x\9) = exp| - B^(f(a:),/r)| 

exp ^p{t{x)) — h(£c)|. 


(16) 


The conjugate prior for 9 has the form: 

p{9\u,t) = exp|(T,0) - vtp{9) - (17) 

p{6\u,^t) can also be expressed with the Bregman diver¬ 
gence: 

p{9\iy,T) = exp{-uB^{T/o,fi)} 

xexp^vp{T/v) - ^{v,t)Y (18) 

2.6 Scaled Exponential Families 

Let 9 = (39, and f}{9) = (dip{9/(3) = f3'i(i{9). The scaled 


exponential family with scale (3 is dehned as follows (Jiang 
[etari[20T2] i: 

pix\9) = exp^{t{x),e)-fj(e)-hpix)''^ 

= ex.pi^(3{t{x),9) - Pi(i{9) - h/3{x)Yl9) 


3.1 Small-Variance Asymptotics for BHC 

The dissimilarity in BHC can be rewritten as follows: 

= - WM) - 

ar{\co\)r{\cMxjnco)p{x,,\neJ 

T{\c\)p{X,\ne) 

X |l-f d(coo, coi)||l-f d(cio,cii )},(22 ) 

where cq = cqo U cqi and ci = cio U cn. We hrst analyze 
the ieTLmp{Xf\'Hc), as in (Jiang et al. 2012|l. 


p{XffHf) = y I jp(^\v, T)d9 

= (3‘^J exp ^^(^9 ,t 3- I3'^t{x,)^ - {o 3- I3\c\)t(;{9) 

-'^hffx.i) - ^{ix,T)\d9 

i£c 

= (3^ exp Uv + I3\c\)p{pf) - ^ hp{xi) - i{v, t) 

^ i^c 

X j exp{-{v + l3\c\)B^{p,^, p))d9, (23) 


where 


def r + 


(24) 
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Note that fi = is a function of 9. The term inside 

the integral of Eq. ( |2T| l has a local minimum at 
and thus can be approximated by Laplace’s method; 


= P‘^exp 


{v + /3|c|)vj(Mc) - 

iGc 



27r \ 

i^ + l3\c\) 


fJ-) 

d6de^ 


_ 1 

{l + 0(r^)}-(25) 

ri=Mc 


It follows from this result that, as [3 —> oo, the asymptotic 
limit of dissimilarity d(co, ci) in ( | 22 ] l is given by 

ar(|co|)r(|ci|)p(X.o|1Hco)p(X,J1HcJ 

/3^oo r(|c|)p(X,|^,) 

« exp|/3(|co|<p(/Xco) + |ci|¥’(Mci) “ |c|<p(A‘c))}> 


Let a = P 2 exp(—/?A), then we have 
= ^l^exp|/3(|co|v5(Mco) + |ci|<p(Mci) - |c|</5(Mc) - ^)}- 

As /3 —oo, the term inside the exponent converges to 

\coW{ico) + \ci\ip(tci) - \c\ip{tc) - A, (26) 

where 

ic AVilat*), (27) 

1^1 


d*{-, •), and we can choose the next pair to merge by com¬ 
paring d*(-, •) instead of d(-, •). 

With Eqs. @ and we conclude that when /3 —> oo, 
BHC reduces to Algorithm with dissimilarity measure 
d*(-,-) and threshold A, where the algorithm terminates 
when the minimum d*(-, •) exceeds A. 


On the other hand, a simple calculation yields 

— I Co I (f Co ! ^c) “t“ I Cl I (foi, f c); 

which is exactly same as the dissimilarity proposed in (|Tel- 


garsky and Dasgupta |2012|l. Due to the close relationship 


between exponential family and the Bregman divergence, 
the dissimilarities derived from two different perspective 
has the same form. 


As an example, assume that p(x\6) = N{x\pL^a‘^I) and 
p(/i) = A/'(/x|0, p^I). We have p{x) = \\x\\^ /(2cr^) and 


d*(co,ci) 


|co||ci| Ha^cQ - a^cif 
2 cr 2 (|co| -p |ci|) 


( 31 ) 


which is same as the Ward’s merge cost ( |Ward 1963| l, ex¬ 
cept for the constant 1/(2 (t^). Other examples can be found 
in (Banerjee et al.[[2005|l. 


Note that d*(-, •) does not need hyperparameter tunings, 
since the effect of prior p{6) is ignored as /3 —0. This 
provides a great advantage over BHC which is sensitive to 
the hyperparameter settings. 


and this is the average of sufficient statistics for cluster c. 
With this result, we dehne a new dissimilarity d*(co, ci) as 


d*(co,ci) \co\p{tco) + \ci\p{tcP) - \c\pitc) 
= |col</5(?co) + |ci|<<5(fci) 

n III n l'"0 1^=0 d” I*"! I^ci ^ 


Smoothing: In some particular choice of p, the singleton 
clusters may have degenerate values (|Telgarsky and Das7 


gupta 20121. Lor example, when p(a;|0) = Mult(a;|m, g). 


the function p{x) = log{a;j /m) has degenerate 

values when xj = 0. To handle this, we use the smooth¬ 
ing strategy proposed in ( |Telgarsky and Dasgupta] |2012| l; 
instead of the original function (p{x), we use the smoothed 
functions ^nd ipi{x) dehned as follows; 


Note that d*(-, •) is always positive since (p is convex. If 
d*(co, Cl) > A, the limit Eq. ( |26] l diverges to oo, and con¬ 
verges to zero otherwise. When |c| = 1, Eq. ( |2^ is the 
same as the limit of the dissimilarity d(co, ci), and thus the 
dissimilarity diverges when d*(co,ci) > A and converges 
otherwise. When |c| > 1, assume that the dissimilarities 
of children d(coo,coi) and d(cio,Cii) converges to zero. 
Erom Eq. ( |2^ , we can easily see that d(co,ci) converges 
only if d*(co, ci) < A. In summary. 


lim d(co,ci) 

p—^OQ 


0 if d*{co,ci) < A, 
oo otherwise. 


(29) 


In similar way, we can also prove the following: 

d(co,Ci) ^ I 0 if d*(co,ci) < d*(c2,C3), 
^—>■00 d(c 2 ,C 3 ) \ oo otherwise. ’ 


which means that comparing two dissimilarities in origi¬ 
nal BHC is equivalent to comparing the new dissimilarities 


po{x) = ip{{\ - a)x + (32) 

pi[x)'^= p{x + a'y), (33) 

where a G ( 0 , 1 ) be arbitrary constant and 7 must in the 
relative interior of the domain of p. In general, we use 
Po{x) a smoothed function, but we can also use pi{x) 
when the domain of is a convex cone. 


Heuristics for choosing A: As in (Kulis and Jordan 2012| |, 
we choose the threshold value A. Eortunately, we found that 
the clustering accuracy was not extremely sensitive to the 
choice of A; merely selecting the scale of A could result 
in reasonable accuracy. There can be many simple heuris¬ 
tics, and the one we found effective is to use the A:-means 
clustering. With the very rough guess on the desired num¬ 
ber of clusters k, we first run the A:-means clustering (with 
Euclidean distance) with k = ak (we fixed a — A for all 
experiments). Then, A was set to the average value of dis¬ 
similarities d*(-, •) between the all pair of k centers. 
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3.2 Reducibility of d*(-, •) 

The relaxed BHC with small-variance asymptotics still has 
the same complexities to BHC. If we can show that d*(', •) 
is reducible, we can reduce the complexities by adapt¬ 
ing the nearest neighbor chain algorithm. Unfortunately, 
d*(-,-) is not reducible in general (one can easily find 
counter-examples for some distributions). However, we ar¬ 
gue that d*(-, •) is reducible in many cases, and thus ap¬ 
plying the nearest neighbor chain algorithm as if d*(-, •) is 
reducible does not degrades the clustering accuracy. In this 
section, we show the reason by analyzing d*(-, •). 

At first, we investigate a term inside the dissimilarity; 


/(u) \c\ip{tc). 

(34) 

The second-order Taylor expansion of this function around 
the mean /r yields; 

fitc) = \c\^pi^l) + \c\ip^^'> (tc - p) 

(35) 

where is the fcth order derivative of (p, and 


y (u - - p.), 

(36) 

,T N def 1 1 d°‘ip{l>) - 

— |c| / ^ \tc M) • 

(37) 


|a |=3 


Here, a is the multi-index notation, and v = /r -|- fc(fc — /r) 
for some k G (0,1). The term A^(tc) plays an important 
role in analyzing the reducibility of d*(-, •). To bound the 
error term e^(tc), we assume that < iwQ As earlier, 
assume that U is a average of |c| observations generated 
from the same scaled-exponential family distribution; 

a;i,... ,£C„ ~ 0), ic = (38) 

c 

' ' zee 


Now, let Co and ci be clusters belong to the same super¬ 
cluster (i.e. JCco and were generated from the same 
mean vector /r). We don’t need to investigate the case 
where the pair belong to a different cluster, since then they 
will not be merged anyway (d*(-, •) > A) in our algorithm. 
By the independence, E[fco] = E[fci] = E[fc„uci] = M- 
Applying the approximation ( |35] ), we have 

ti* (CQ 5 Cl ) ~ Aip (tco ) (fci ) Aip (f CqUCi ) 

= 2(|coI+Tci|) (^^0 - -icj- (41) 


This approximation, which we will denote as d*(co,ci), 
is a generalization of the Ward’s cost pT] ) from Euclidean 
distance to Mahalanobis distance with matrix (/r) (note 
that this approximation is exact for the spherical Gaussian 
case). More importantly, d,^(co, Ci) is reducible. 

Theorem 2. 


d*(co,ci) < min|d*(co,C 2 ),d*(ci,C 2 )| 
iiin |d*(co, C 2 ), d*(ci, C 2 )| < d,^(co U Ci, C 2 ). 


Proof. For the Ward’s cost, the following Lance-Williams 
update formula ( Lance and Williams[ |1967| l holds for 


d*(co U ci,C 2 ) = 


(|co| -f |c 2 |)d*(co,C 2 ) 


-f 


|co| + |ci| -I- |C 2 | 

(|ci| + |c2|)d,^(ci,C2) - |c2|d,t(co,Ci) 
|co| + |ci| + |c 2 | 

Hence, by the assumption, we get 

d*(co Uci,C 2 ) > min |d*(co, C 2 ), d*(ci, C 2 )|. 


. (42) 


□ 


By the property of the log-partition function of the expo¬ 
nential family distribution, we get the following results; 


E[e^(U)] 


1 


E 


a! 


(39) 


One can see that the expected error converges to zero as 
/3^|c| —> 00 . Also, it can be shown that the expectation of 
the ratio of two terms converges to zero as fl ^ 00 : 


lim E 

^—>■00 


^ip (^c) 


= 0 , 


(40) 


As a result, the dissimilarity d*(-, •) is reducible provided 
that the Taylor’s approximation ( |35| ) is accurate. In such a 
case, one can apply the nearest-neighbor chain algorithm 
with d*(-, •), treating as if it is reducible to build a binary 
tree in 0{nf) time and 0{n) space. Unlike Algorithmic 
we have a threshold A to determine the number of clusters, 
and we present a slightly revised algorithm (Algorithmic. 
Note again that the revised algorithm generates forests in¬ 
stead of trees. 

4 EXPERIMENTS 


which means that A^{tc) asymptotically dominates e^{tc) 
(detailed derivations are given in the supplementary mate¬ 
rial). Hence, we can safely approximate f{tc) up to second 
order term. 

'This assumption holds for the most of distributions we will 
discuss (if properly smoothed), hut not holds in general. 


4.1 Experiments on Synthetic Data 

Testing the reducihility of d*(-, ); We tested the re¬ 
ducibility of d*(-, •) empirically. We repeatedly generated 
the three clusters co,ci and C 2 from the exponential fam¬ 
ily distributions, and counted the number of cases where 
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Figure 1: Average relative error vs maximum cluster size and scale factor /3. 


Table 2: Average adjusted Rand index values for randomly generated datasets. Best ones are marked as bold face. 



Poisson (k) 

Poisson (2k) 

multinomial (k) 

multinomial (2k) 

Gaussian (k) 

Gaussian (2k) 

single linkage 
complete linakge 
Ward’s method 
RBHC-greedy 
RBHC-nnca 
BHC 

0.091 (0.088) 
0.381 (0.091) 
0.465 (0.119) 
0.469 (0.112) 
0.469 (0.112) 
0.265 (0.080) 

0.015 (0.030) 
0.263 (0.059) 
0.273 (0.049) 
0.290 (0.056) 
0.290 (0.056) 
0.134 (0.052) 

0.000 ( 0 . 000 ) 
0.266 (0.144) 
0.770 (0.067) 
0.870 (0.046) 
0.865 (0.045) 
0.907 (0.069) 

0.000 ( 0 . 000 ) 
0.090 (0.025) 
0.564 (0.055) 
0.733 (0.028) 
0.736 (0.040) 
0.894 (0.044) 

0.270 (0.280) 
0.565 (0.166) 
0.779 (0.145) 
0.875 (0.087) 
0.875 (0.087) 
0.863 (0.109) 

0.057 (0.055) 
0.475 (0.141) 
0.763 (0.122) 
0.883 (0.067) 
0.883 (0.067) 
0.860(0.108) 


Algorithm 3 Nearest neighbor chain algorithm for BFIC with 
small-variance asymptotics 

Input: X = d*(-, ■), A. 

Output: A clustering C. 

1: Set R — [n] and (7 = 0. 

2: while R ^ 0 do 
3: Pick any cq G R- 

4: Build a chain ci = nn(co),C 2 = nn(ci),..., where 

nn(c) argmin^/dsfor(c, c'). Extend the chain until 
a = nn(ci_i) and Ci_i = nn(ci). 

5: Remove d and d-i from R. 

6 : if d*(ci-i, Ci) < A then 

7: Add c = Ci U Ci_i to R. 

8 : If i > 2, go to line 3 and extend the chain from Ci_ 2 . 

Otherwise, go to line 2. 

9: else 

10: Add Ci and Ci_i to C. 

11: end if 
12: end while 


the dissimilarities between those clusters are not reducible. 
We also measured the average value of the relative error 
to support our arguments in Section 3.2 We tested three 
distributions; Poisson, multinomial and Gaussian. At each 
iteration, we first sampled the size of the clusters |co|, 
|ci| and |c 2 | from Unif([20,100]). Then we sampled the 
three clusters from one of the three distributions, and com¬ 
puted d*(co, Cl), d*(ci, C 2 ) and d*(co,C 2 ). We then first 
checked whether these three values satisfy the reducibility 
condition Q (for example, if d*(co, C 2 ) is the smallest, we 
checked if d*(co U C 2 , ci) > min{d*(co, ci), d*(c 2 , ci)}). 
Then, for d*(co,ci), we computed the approximate value 


Table 1: Average value of the exact dissimilarity d*, approximate 
dissimilarity d* and relative error, and number of not-reducible 
case among 100,000 trials. 



E[d*] 

E[d*] 

average 
relative 
error (%) 

not 

reducible 

/lOOfc 

Poisson 

0.501 

0.500 

1.577 

0 

Multinomial 

3.594 

3.642 

4.675 

87 

Gaussian 

8.739 

8.717 

7.716 

48 


d*(co, Cl) and measured the relative error 

d*(co,ci) -|-d*(co,ci) 

We repeated this process for 100,000 times for the three 
distributions and measured the average values. For Poisson 
distribution, we sampled the mean p ^ Gamma(2, 0.05) 
and sampled the data from Poisson(/9). We smoothed the 
function i^(a:) = a;log(a;)—a: as 4-0.01) to prevent de¬ 
generate function values. For multinomial distribution, we 
tested the case where the dimension d is 10 and the number 
of trials m is 5. We sampled the parameter q ~ Dir(5 • 1^) 
where is d-dimensional one vector, and sampled the 
data from Mult(< 7 ). We smoothed the function (p{x) = 
J2‘j=i ^og{xj/m) as (^(0.9a; -F O.lmld/d). For Gaus¬ 
sian, we tested with d = 10, and sampled the mean and 
covariance /x, S ~ A/’(0, (0.08 • A)~^)W{d + 2, fP) (fP = 
AA^ +dl where A was sampled from unit Gaussian). We 
smoothed the function ip{x, X) = —| log det(X — xx^) 
as — i logdet(X — xx^ + O.OIJ). 































BHC with Exponential Family: Small-Variance Asymptotics and Reducibility 


The result is summarized in Table the generated dissim¬ 
ilarities were reducible in most case, as expected. The rel¬ 
ative error was small, which supports our arguments of the 
reason why d,^(', •) is reducible with high probability. We 
also measured the change of average relative error by con¬ 
trolling two factors; the maximum cluster size rimax and 
variance scale factor /3. We plotted the average relative er¬ 
ror of Gaussian distribution by changing those two factors, 
and the relative error decreased as predicted (Figure[^. 

Clustering synthetic data; We evaluated the cluster¬ 
ing accuracies of original BHC (BHC), BHC in small- 
variance limit with greedy algorithm (RBHC-greedy), 
BHC in small-variance limit with nearest neighbor chain 
method (RBHC-nnca), single linkage, complete linkage, 
and Ward’s method. We generated the datasets from Pois¬ 
son, multinomial and Gaussian distribution. We tested two 
types of data; 1,000 elements with 6 clusters and 2,000 
elements with 12 clusters. For Poisson distribution, each 
mixture component was generated from Poisson(p) with 
p ~ Gamma(2, 0.05) for both datasets. For multinomial 
distribution, we set d = 20 and m = 10 for 1,000 ele¬ 
ments dataset, and set d = 40 and m = 10 for 2,000 
elements datasets. For both dataset, we sampled the pa¬ 
rameter q ~ Dir(0.5 • 1^). For Gaussian case, we set 
d = 3 for 1,000 elements dataset and set d = 6 for 
2,000 elements dataset. We sampled the parameters from 
JV{0, (0.08 • 6) for 1,000 elements, and from 

JV{0, (0.2 • A)“^)>V(\P, 9) for 2,000 elements. For each 
distribution and type (1,000 or 2,000), we generated 10 
datasets for each type and measured the average clustering 
accuracies. 


We evaluated the clustering accuracy using the adjusted 
Rand index ( [Hubert and Arabi^ |1985| l. For traditional 
agglomerative clustering algorithms, we assumed that we 
know the true number of clusters k and cut the tree at 
corresponding level. For RBHC-greedy and RBHC-nnca, 
we selected the threshold A with the heuristics described 
in Section 3.1 For original BHC, we have to carefully 
tune the hyperparemters, and the accuracy was very sen¬ 
sitive to this setting. In the case of Poisson distribution 
where p ~ Gamma(a, b), we have to tune two hyperpa¬ 
rameters {a, b}. For multinomial case where q ~ Dir(Q:), 
we set a = aid and tuned a. For Gaussian case where 
(/X, S) ~ A/'(m, (rA)“^)>V(;x,we have four hy¬ 
perparameters {m, r, ly, fP}. We set m to be the empirical 
mean of X and fixed r = 0.1 and i/ = d + 6. We set 
= kS where S is the empirical covariance of X and 
controlled k according to the dimension and the size of the 
data. The result is summarized in Table |2] 


The accuracies of RBHC-greedy and RBHC-nnca were 
best for most of the cases, and the accuracies of the two 
methods were almost identical expect for the multinomial 
distribution. BHC was best for the multinomial case where 
the hyperparameter tuning was relatively easy, but showed 


poor performance in Poisson case (we failed to find the best 
hyperparameter setting in that case). Hence, it would be a 
good choice to use RBHC-greedy or RBHC-nnca which do 
not need careful hyperparameter tuning, and RBHC-nnca 
may be the best choice considering its space and time com¬ 
plexity compared to RBHC-greedy. 


4.2 Experiments on Real Data 


We tested the agglomerative clustering algorithms on two 
types of real-world data. The first one was a subset of 


MNIST digit database (LeCun et al. 19981. We scaled 


down the original 28x28 to 7x7 and vectorized each image 
to be vector. Then we sampled 3,000 images from the 
classes 0, 3, 7 and 9. We clustered this dataset with Gaus¬ 
sian asssumption. The second one was visual-word data ex¬ 
tracted from CaltechlOl database ( jFei-Fei et al.[|200^ . We 
sampled 2,033 images from ’’Airplane”, ’’Mortorbikes” and 
”Faces-easy” classes, and extracted SIFT features for im¬ 
age patches. Then we quantized those features into 1,000 
visual words. We clustered the data with multinomial as¬ 
sumption. Table shows the ARI values of agglomera¬ 
tive clustering algorithms. As in the synthetic experiments, 
the accuracy RBHC-greedy and RBHC-nnca were identi¬ 
cal, and outperformed the traditional agglomerative clus¬ 
tering algorithms. BHC was best for CaltechlOl, where the 
multinomial distribution with easy hyperparameter tuning 
was assumed. However, BHC was even worse than Ward’s 
method for MNIST case, where we failed to tune 49 x 49 
matrix 


Table 3: Average adjusted Rand index values for MNIST and Cal¬ 
techlOl datasets. Best ones are marked as bold face. 



MNIST 

CaltechlOl 

single linkage 

0.000 

0.000 

complete linakge 

0.187 

0.000 

Ward’s method 

0.485 

0.465 

RBHC-greedy 

0.637 

0.560 

RBHC-nnca 

0.637 

0.560 

BHC 

0.253 

0.646 


5 CONCLUSIONS 

In this paper we have presented a non-probabilistic counter¬ 
part of BHC, referred to as RBHC, using a small variance 
relaxation when underlying likelihoods are assumed to be 
conjugate exponential families. In contrast to the original 
BHC, RBHC does not requires careful tuning of hyperpa¬ 
rameters. We have also shown that the dissimilarity mea¬ 
sure emerged in RBHC is reducible with high probability, 
so that the nearest neighbor chain algorithm was used to 
speed up the RBHC and to reduce the space complexity, 
leading to RBHC-nnca. Experiments on both synthetic and 
real-world datasets demonstrated the validity of RBHC. 
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BHC with Exponential Family: Small-Variance Asymptotics and Reducibility 


A Detailed Derivation in Section 13.21 

For Xc = {jcJiGc ~ p{x\P, 0), we have 


p(.tc\/3,0) = exp j/?|c|(ic,0) - I3\c\tp{e) -'^hp{xi)\. 

^ i£c ^ 

For Rotational simplicity, we let tc = y from now. By the normalization property, 

P\c\ip{e) = log j exp|/3|c|(y,0) 


(44) 


(45) 


Differentiating both sides by 6 yields 


= JI3\c\y ■ p{y\l3,e)dy, = E[j/]. 

Also, we have 

= y/3|c|yiF(y|/3,6>)(^/3|c|% 

= ^^|cpE[?/^-yfc] - ^2|c|2E[y]^-E[y]fc = \c\^ coy{ yj,yk). 

Hence, 

1 d‘^'ib{0^ f 

]^\Wl)ek ^ ^ J -E[y]j)(2/fc -E[y]fc)p(y|/3,e)dt/jdt/fe. 


Differentiating this again yields 

1 d^-ip^O) 
d9jd9kdei 


J (% - E[y]j)(yfe - E[y]fc)(yi - E[y]i)p{y\/3, e)dyjdykdyi 
E[{yj - E[y]j){yk - E[y]k)iyi - E[y]i)]. 


(46) 


(47) 


(48) 


(49) 


Unfortunately, this relationship does not continue after the third order; the fourth derivative of is not exactly match 
to the fourth order central moment of y. However, one can easily maintain the mth order central moment by manipulating 
the TOth order derivative of and mth order central moment always have the constant term 

Equation (40) of the paper is a simple consequence of the equation ( |49] l. To prove the equation (41) of the paper, we use 
the following relationship: 


E 


C-tpiXc') 

^ip{Xc) 


E[ey(a;e)] _ cov(ey,(a;e), Ay(tCe)) E[ey(a;e)]var[Ay(3;^)] 

E[Ay(Sc)] E[Ay(ic)]^ E[Ay(ic)]^ 


(50) 


Now it is easy to show that this equation converges to zero when /3 —0; all the expectations and variances can be obtained 
by differentiating '0(^) for as many times as needed. 

















